Methods and computer program products for the quality control of nucleic acid assay

ABSTRACT

The disclosed invention provides methods and computer program products for the improved verification and controlling of assays for the analysis of nucleic acid variations by means of statistical process control. The invention is characterised in that variables of each experiment are monitored by measuring deviations of said variables from a reference data set and wherein said experiments or batches thereof are indicated as unsuitable for further interpretation if they exceed predetermined limits.

TECHNICAL FIELD

The field of the invention relates to methods and computer programproducts for the control of assays for the analysis of nucleic acidwithin DNA samples.

BACKGROUND ART

A fundamental goal of genomic research is the application of basicresearch into the sequence and functioning of the genome to improvehealthcare and disease management. The application of novel disease ordisease treatment markers to clinical and/or diagnostic settings oftenrequires the adaptation of suitable research techniques to large scalehigh throughput formats. Such techniques include the use of large scalesequencing, mRNA analysis and in particular nucleic acid microarrays.DNA microarrays are one of the most popular technologies in molecularbiology today. They are routinely used for the parallel observation ofthe mRNA expression of thousands of genes and have enabled thedevelopment of novel means of marker identification, tissueclassification, and discovery of new tissue subtypes. Recently it hasbeen shown that microarrays can also be used to detect DNA methylationand that results are comparable to mRNA expression analysis, see forexample P. Adoian et al. Tumour class prediction and discovery bymicroarray-based DNA methylation analysis. Nucleic Acid Research, 30(5),02. and T. Golub et al. Molecular classification of cancer: Classdiscovery and class prediction by gene expression monitoring. Science,286:531-537, 1999.

Despite the popularity of microarray technology, there remain seriousproblems regarding measurement accuracy and reproducibility.Considerable effort has been put into the understanding and correctionof effects such as background noise, signal noise on a slide anddifferent dye efficiencies see for example C. S. Brown et al. Imagemetrics in the statististical analysis of dna microarray data Proc NatlAcad Sci USA, 98(16):8944-8949, July 2001 and G. C. Tseng et al. Issuesin cdna microarray analysis: Quality filtering, channel normalization,models of variations and assessment of gene effects. Nucleic AcidsResearch, 29(12):2549-2557, 2001. However, with the exception of overallintensity normalization (A. Zien et al. Centralization: A new method forthe normalization of gene expression data. Proc. ISMB'01/Bioinformatics, 17(6):323-331, 2001), it is not clear how to handlevariations between single slides and systematic alterations betweenslide batches. Between slide variations are particularly problematicbecause it is difficult to explicitly model the numerous differentprocess factors which may distort the measurements. Some examples areconcentration and amount of spotted probe during array fabrication, theamount of labeled target added to the slide and the general conditionsduring hybridization. Other common but often neglected problems arehandling errors such as accidental exchange of different probes duringarray fabrication. These effects can randomly affect single slides orwhole slide batches. The latter is especially dangerous because itintroduces a systematic error and can lead to false biologicalconclusions.

There are several ways to reduce between slide variance and systematicerrors. Removing obvious outlier chips based on visual inspection is aneasy and effective way to increase experimental robustness. A morecostly alternative is to do repeated chip experiments for every singlebiological sample and obtain a robust estimate for the average signal.With or without chip repetitions randomized block design can furtherincrease certainty of biological findings. Unfortunately, there areseveral problems with this approach. Outliers can not always be detectedvisually and it is not feasible to make enough chip repetitions toobtain a fully randomized block design for all potentially importantprocess parameters. However, when experiments are standardized enough,process dependent alterations are relatively rare events. Thereforeinstead of reducing these effects by repetitions one should ratherdetect problematic slides or slide batches and repeat only those. Thiscan only be achieved by controlling process stability.

Maintaining and controlling data quality is a key problem in highthroughput analysis systems. The data quality is often hampered byexperiment to experiment variability introduced by the environmentalconditions that may be difficult to control.

Examples of such variables include, variability in sample preparationand uncontrollable reaction conditions. For example, in the case ofmicro array analysis systematic changes in experimental conditionsacross multiple chips can seriously affect quality and even lead tofalse biological conclusions. Traditionally the influence of theseeffects has been minimized by expensive repeated measurements, because adetailed understanding of all process relevant parameters appears to bean unreasonable burden.

Process stability control is well known in many areas of industrialproduction where multivariate statistical process control (MVSPC) isused routinely to detect significant deviations from normal workingconditions. The major tool of MVSPC is the T² control chart, which is amultivariate generalization of the popular univariate Shewhart controlprocedure.

See for example U.S. Pat. No. 5,693,440. In this application Hotelling'sT2 in combination with a simple PCA was used as a means of processverification in photographic processes. Although this applicationdemonstrates the use of simple principle component analysis, thebenefits of this are not obvious as the data set was not of a highdimensionality as is often encountered in biotechnological assays suchas sequencing and microarray analysis. Furthermore, this applicationrecommends the application of PCA on the “cleared” reference data set,which may hide variations caused by the data set to be monitored.

The application of MVSPC for statistical quality control of microarrayand high throughput sequencing experiments is not straightforward. Thisis because most of the relevant process parameters of a microarrayexperiment cannot be measured routinely in a high throughputenvironment.

5-methylcytosine is the most frequent covalent base modification of theDNA of eukaryotic cells. Cytosine methylation only occurs in the contextof CpG dinucleotides. It plays a role, for example, in the regulation ofthe transcription, in genetic imprinting, and in tumorigenesis.Methylation is a particularly relevant layer of genomic informationbecause it plays an important role in expression regulation (K. D.Robertson et al. DNA methylation in health and disease. Nature ReviewsGenetics, 1:11-19, 2000). Methylation analysis has therefore the samepotential applications as mRNA expression analysis or proteomics. Inparticular DNA methylation appears to play a key role in imprintingassociated disease and cancer (see for example, Zeschnigk M, Schmitz B,Dittrich B, Buiting K, Horsthemke B, Doerfler W. “Imprinted segments inthe human genome: different DNA methylation patterns in thePrader-Willi/Angelman syndrome region as determined by the genomicsequencing method” Hum Mol Genet 1997 March; 6(3):387-95 and Peter A.Jones “Cancer. Death and methylation”. Nature. 2001 Jan. 11;409(6817):141, 143-4. The link between cytosine methylation and cancerhas already been established and it appears that cytosine methylationhas the potential to be a significant and useful clinical diagnosticmarker.

The application of molecular biological techniques in the field ofmethylation analysis have hereto been limited to research applications,to date it is not a commercially utilised clinical marker. Theapplication of methylation disease markers to a large scale analysisformat suitable for clinical, diagnostic and research purposes requiresthe implementation and adaptation of high throughput techniques in thefield of molecular biology to the specific constraints and demandsspecific to methylation analysis. Preferred techniques for such analysesinclude the analysis of bisulfite treated sample DNA by means of microarray technologies, and real time PCR based methods such as MethyLightand HeavyMethyl.

DISCLOSURE OF INVENTION BRIEF DESCRIPTION

The described invention provides a novel method and computer programproducts for the process control of assays for the analysis of nucleicacid within DNA samples. The method enables the estimation of thequality of an individual assay based on the distribution of themeasurements of variables associated with said assay in comparison to areference data set. As these measurements are extremely high dimensionaland contain outliers the application of standard MVSPC methods isprohibited. In a particularly preferred embodiment of the method arobust version of principle component analysis is used to detectoutliers and reduce data dimensionality. This step enables the improvedapplication of multivariate statistical process control techniques. In aparticularly preferred embodiment of the method, the T² control chart isutilised to monitor process relevant parameters. This can be used toimprove the assay process itself, limits necessary repetitions toaffected samples only and thereby maintains quality in a cost effectiveway.

DETAILED DESCRIPTION

In the following application the term ‘statistical distance’ is taken tomean a distance between datasets or a single measurement vector and adata set that is calculated with respect to the statistical distributionof one or both data sets. In the following the term ‘robust’ when usedto describe a statistic or statistical method is taken to mean astatistic or statistical method that retains its usefulness even whenone or more of its assumptions (e.g. normality, lack of gross errors) isviolated.

The method and computer program products according to the disclosedinvention provide novel means for the verification and controlling ofbiological assays. Said method and computer program products may beapplied to any means of detecting nucleic acid variations wherein alarge number of variables are analysed, and/or for controllingexperiments wherein a large number of variables influence the quality ofthe experimental data Said method is therefore applicable to a largenumber of commonly used assays for the analysis of nucleic acidvariations including, but not limited to, microarray analysis andsequencing for example in the fields of mRNA expression analysis, singlenucleotide polymorphism detection and epigenetic analysis.

To date, the automated analysis of nucleic acid variations has beenlimited by experiment to experiment variation. Errors or fluctuations inprocess variables of the environment within which the assays are carriedout can lead to decreased quality of assays which may ultimately lead tofalse interpretations of the experimental results. Furthermore, cerinconstraints of assay design, most notably nucleic acid sequence (whichaffects factors such as cross hybridisation, background and noise inmicroarray analysis), may be subject to experiment to experimentvariation further complicating standard means of assay result analysisand data interpretation.

One of the factors that complicates the controlling of such highthroughput assays within predetermined parameters is the highdimensionality of the datasets which are required to be monitored.Therefore, multiple repetitions of each assay are often carried out inorder to minimize the effects of process artefacts in the interpretationof complex nucleic acid assays. There is therefore a pronounced need inthe art for improved methods of insuring the quality of high throughputgenomic assays.

In one embodiment, the method and computer program products according tothe invention provide a means for the improved detection of assayresults which are unsuitable for data interpretation. The disclosedmethod provides a means of identifying said unsuitable experiments, orbatches of experiments, said identified experiments thereupon beingexcluded from subsequent data analysis. In an alternative embodimentsaid identified experiments may be further analysed to identify specificoperating parameters of the process used to carry out the assay. Saidparameters may then be monitored to bring the quality of subsequentexperiments within predetermined quality limits. The method and computerprogram products according to the invention thereby decrease therequirement for repetition of assays as a standard means of qualitycontrol. The method according to the invention further provides a meansof increasing the accuracy of data interpretation by identifyingexperiments unsuitable for data analysis.

In the following it is particularly preferred that all herein describedelements of the method are implemented by means of a computer.

The aim of the invention is achieved by means of a method of verifyingand controlling nucleic acid analysis assays using statistical processcontrol and/or and computer program products used for said purpose. Thestatistical process control may be either multivariate statisticalprocess control or univariate statistical process control. Thesuitability of each method will be apparent to one skilled in the art.The method according to the invention is characterized in that variablesof each experiment are monitored, for each experiment the statisticaldistance of said variables from a reference data set (also hereinreferred to as a historical data set) are calculated and wherein adeviation is beyond a pre-determined limit said experiment is indicatedas unsuitable for further interpretation. It is particularly preferredthat the method according to the invention is implemented by means of acomputer.

In a preferred embodiment this method is used for the controlling andverification of assays used for the determination of cytosinemethylation patterns within nucleic acids. In a particularly preferredembodiment the method is applied to those assays suitable for a highthroughput format, for example but not limited to, sequencing andmicroarray analysis of bisulphite treated nucleic acids.

In one embodiment, the method according to the invention comprises foursteps. In the first step a reference data set (also herein referred toas a historical data set) is defined, said data set consisting of allthe variables that are to be monitored and controlled. In the secondstep a test data set is defined. Said test data set consists of theexperiment or experiments that are to be controlled, and wherein eachexperiment is defined according to the values of the variables to beanalysed.

In the third step of the method the statistical distance between thereference and test data sets or elements or subsets thereof aredetermined. In the fourth step of the method individual elements orsubsets of the test dataset which have a statistical distance largerthan that of a predetermined value are identified.

In a particularly preferred embodiment of the method, subsequent to thedefinition of the reference and test data sets the method comprises afurther step, hereinafter referred to as step 2ii). Said step comprisesreducing the data dimensionality of the reference and test data set bymeans of robust embedding of the values into a lower dimensionalrepresentation. The embedding space may be calculated by using one orboth of the reference and the test data set. It is particularlypreferred that the data dimensionality reduction is carried out by meansof principle component analysis. In one embodiment of the method stepbii) comprises the following steps. In the first step the data set isprojected by means of robust principle component analysis. In the secondstep outliers are removed from the data set according to theirstatistical distances calculated by means of one or more methods takenfrom the group consisting of: Hotelling's T² distance; percentiles ofthe empirical distribution of the reference data set; Percentiles of akernel density estimate of the distribution of the reference data setand distance from the hyperplane of a nu-SVM (see Schlkopf, Bernhard andSmola, Alex J. and Williamson, Robert C. and Bartlett, Peter L., NewSupport Vector Algorithms. Neural Computation, Vol. 12, 2000.),estimating the support of the distribution of the reference data set. Inthe third step the embedding projection is calculated by means ofstandard principle component analysis and the cleared or the completedata set is projected onto this basis vector system.

In one embodiment of the method at least one of the variables measuredin steps a) and b) is determined according to the methylation state ofthe nucleic acids.

In a further preferred embodiment of the method at least one of thevariables measured in the first and second steps is determined by theenvironment used to conduct the assay, wherein the assay is a microarrayanalysis it is further preferred that these variables are independent ofthe arrangement of the oligonucleotides on the array. In a particularlypreferred embodiment said variables are selected from the groupcomprising mean background/baseline values; scatter of thebackground/baseline values; scatter of the foreground values,geometrical properties of the array, percentiles of background values ofeach spot and positive and negative assay control measures.

In a further preferred embodiment of the method at least one of thevariables measured in the first and second steps is determined by theenvironment used to conduct the assay, wherein the assay is a microarrayanalysis it is further preferred that these variables are independent ofthe arrangement of the oligonucleotides on the array.

In a particularly preferred embodiment wherein the assay is a microarraybased assay said variables are selected from the group comprising meanbackground/baseline intensity values; scatter of the background/baselineintensity values; coefficient of variation for background spotintensities, statistical characterisation of the distribution of thebackground/baseline intensity values (1%, 5%, 10%, 25% 50%, 75% 90%,95%, 99% percentiles, skewness, kurtosis), scatter of the foregroundintensity values; coefficient of variation for foreground spotintensities; statistical characterisation of the distribution of theforeground intensity values (1%, 5%, 10%, 25% 50%, 75% 90%, 95%, 99%percentiles, skewness, kurtosis), saturation of the foreground intensityvalues, ratio of mean to median foreground intensity values, geometricalproperties of the array as in the gradient of background intensityvalues calculated across a set of consecutive rows or columns along agiven direction, mean spot diameter values, scatter of spot diametervalues, percentiles of spot diameter value distribution across themicroarray, and positive and negative assay control measures.

When selecting appropriate variables for the analysis an importantcriterion is that the statistical distribution of these variables doesnot change significantly between different series of experiments(wherein each series of experiments is defined as a large series ofmeasurements carried out within one time period and with the same assaydesign). This allows the utillisation of measurements from previousstudies as reference data sets.

Wherein the assay is a microarray based assay it is preferred that thevariables to be analysed include at least one variable that refers toeach of the foreground, background, geometrical properties andsaturation of the microarray. A particularly preferred set of variablesis as follows:

-   -   Background        -   1. 75% quantile of all observed values of the percentage of            background pixel per spot above the mean signal+one standard            deviation        -   2. 75% quantile of all observed values of the percentage of            background pixel per spot above the mean signal+two standard            deviations        -   3. skewness of the distribution of observed values of the            median background intensity per spot        -   4. mean value of the ratio of observed values: mean            background intensity divided by median background intensity            per spot    -   Geometry        -   1. 75% quantile of all observed values of the difference of            background intensities of four consequtive rows avereraged            and the following 4 consequtive rows        -   2. same as in 1. for columns    -   Spot Characteristic        -   1. 95% quantile of all observed spot diameters        -   2. median (50% quantile) of all observed spot diameters        -   3. 75% quantile of the ratio of observed values defined by:            standard deviation of foreground intensity per spot divided            by mean of foreground intensity per spot        -   4. median of the ratio of all observed values defined by:            mean foreground intensity per spot divided by median            foreground intensity per spot    -   Saturation        -   1. 95% quantile of foreground intensity pixel saturation            percentage per spot values

For each variable or group thereof the further steps of the method areaccording to the described method. Therefore, in one embodiment of themethod first calculate the statistical distance of each variable fromthe reference dataset. It is preferred that the reference data set iscomposed of a large set of previous measurements, that is obtained undersimilar experimental conditions. Then combine variables within eachcategory either by embedding into a 1-dimensional space or by averagingsingle values.

Preferably, both the statistical distance and the embedding is carriedout in a robust way.

In a further preferred embodiment the to calculate quality of theexperiment first calculate a lower dimensional embedding of both thereference and the test data seL It is preferred that the reference dataset that is used is composed of a large set of previous measurements,that are obtained under similar experimental conditions. Secondly,calculate the statistical distance in this reduced dimensional space.Use this statistical distance as the quality score.

It will be obvious to one skilled in the art that is not necessary thatthe second step of the method is temporally subsequent to the first stepof the method. The reference data set may be defined subsequent to thetest data set, alternatively it may be a defined concurrently with thetest data set. In one embodiment of the method the reference data setmay consist of all experiments run in a series wherein said series isuser defined. To give one example, where a microarray assay is appliedto a series of tissue samples the measured variables of all the samplesmay be included in said reference data set, however analyses of the sametissue set using an alternative array may not. Accordingly the test dataset may be a subset of or identical to the reference data set. Inanother embodiment of the method the reference data set consists ofexperiments that were carried out independent or separate from those ofthe test data set. The two data sets may be differentiated by factorssuch as but not limited to time of production, operator (human ormachine), environment used to carry out the experiment (for example, butnot limited to temperature, reagents used and concentrations thereof,temporal factors and nucleic acid sequence variations).

In a further embodiment of the method the reference data set is derivedfrom a set of experiments wherein the value of each analysed variable ofeach experiment is either within predetermined limits or, alternatively,said variables are controlled in an optimal manner.

In step 4 of the method the statistical distance may calculated by meansof one or more methods taken from the group consisting of theHotelling's T² distance between a single test measurement vector and thereference data set, the Hotelling'-T² distance between a subset of thetest data set and the reference data set, the distance between thecovariance matrices of a subset of the test data set and the covariancematrix of the reference set, percentiles of the empirical distributionof the reference data set and percentiles of a kernel density estimateof the distribution of the reference data set, distance from thehyperplane of a nu-SVM (see Schlkopt, Bernhard and Smola, Alex J. andWilliamson, Robert C. and Bartlett, Peter L., New Support VectorAlgorithms. Neural Computation, Vol. 12, 2000.), estimating the supportof the distribution of the reference data set. Wherein Hotelling's T²distance between a single test measurement vector and the reference dataset is measured, it is preferred that the T² distance is calculated byusing the sample estimate for mean and variance or any robust estimatefor location, including trimmed mean, median, Tukey's biwight,11-median, Oja-median, minimum volume ellipsoid estimator and Sestimator(see Hendrik P. Lopuhaa and Peter J. Rousseeuw: Breakdown points ofaffine equivariant estimators of multivariate location and covariancematrices) and any robust estimate for scale including Median AbsoluteDeviation, interquantile range Qn-estimator, minimum volume ellipsoidestimator and S-timator.

In a particularly preferred embodiment this is defined as:T ²(i)=(m _(i)−μ)′S ⁻¹(m _(i)−μ)wherein reference set mean$\mu = {\left( {1/N_{C}} \right){\sum\limits_{i = 1}^{N_{C}}m_{i}}}$and the reference set sample covariance matrix$S = {{1/\left( {N_{C} - 1} \right)}{\sum\limits_{i = 1}^{N_{C}}{\left( {m_{i} - \mu} \right)\left( {m_{i} - \mu} \right)^{\prime}}}}$wherein N_(c) is the number of experiments in the reference set andm_(i) is the is the ith measurement vector of the reference or test dataset.

Wherein the Hotelling'-T² distance is calculated between a subset of thetest data set and the reference data set, it is preferred that the T² iscalculated by using the sample estimate for mean and variance or anyrobust estimate for location, including trimmed mean, median, Tukey'sbiwight, 11-median, Oja-median and any robust estimate for scaleincluding Median Absolute Deviation, interquantile range Qn-estimator,minimum volume ellipsoid estimator and S-estimator. In a particularlypreferred embodiment this is defined as:T _(w) ²(i)=(μ_(HDS)−μ_(CDS))^(T) {overscore (S)} ⁻¹(μ_(HDS)−μ_(CDS))Wherein ‘HDS’ refers to the historical data set, also referred to hereinas the reference data set and ‘CDS’ refers to the current data set alsoreferred to herein as the test data set Furthermore, {overscore (S)} iscalculated from the sample covariance matrices S_(HDS) and S_(CDS)$\overset{\_}{S} = {\frac{{\left( {N_{HDS} - 1} \right)S_{HDS}} + {\left( {N_{CDS} - 1} \right)S_{CDS}}}{N_{HDS} + N_{CDS} - 2}.}$

Wherein the statistical distance is calculated as the distance betweenthe covariance matrices of a subset of the test data set and thecovariance matrix of the reference set, it is preferred that the teststatistics of the likelihood ratio test for different covariancematrixes are included. See for example Hartung J. and Epelt B:Multivariate Statistik. R. Oldenburg, Miinchen, Wien, 1995. In aparticularly preferred embodiment this is defined as:${L(i)} = {2\left\lbrack {{\ln{\overset{\_}{S}}} - {\frac{N_{HDS} - 1}{N_{HDS} + N_{CDS} - 2}\ln{S_{HDS}}} - {\frac{N_{CDS} - 1}{N_{HDS} + N_{CDS} - 2}\ln{S_{CDS}}}} \right\rbrack}$

In a further embodiment of the method, subsequent to steps 1 to 4, themethod may further comprise a fifth step. In a first embodiment of themethod said identified experiments or batches thereof are furtherinterrogated to identify specific operating parameters of the processused to carry out the assay that may be required to be monitored tobring the quality of the assays within predetermined quality limits. Inone embodiment of the method this is enabled by means of verifying theinfluence of each individual variable by computing its' univariate T²distances between reference and test data set. In a further embodimentone may analyse the orthogonalized T² distance computing the PCAembedding of step 2ii) based on the reference data set. The principlecomponent responsible for the largest part of the T² distance of an outof control test data point may then be identified. Responsibleindividual variables can be identified by their weights in thisprinciple component. In a further embodiment variables responsible forthe out of control situation can be identified by backward selection. Asubset of variables or single variables can be excluded from thestatistical distance calculation and one can observe whether thecomputed distance gets significantly smaller. Wherein the computedstatistical distance significantly decreases one can conclude that theexcluded variables were at least partially responsible for the observedout of control situation.

In a further embodiment, said identified assays are designated asunsuitable for data interpretation, the experiment(s) are excluded fromdata interpretation, and are preferably repeated until identified ashaving a statistical distance within the predetermined limit.

In a particularly preferred embodiment, the method further comprises thegeneration of a document comprising said elements or subsets of the testdata determined to be outliers. In a further embodiment said documentfurther comprises the contribution of individual variables to thedetermined statistical distance. It is preferred that said document begenerated in a readable manner, either to the user of the computerprogram or by means of a computer, and wherein said computer readabledocument further comprises a graphical user interface.

Said document may be generated by any means standard in the art,however, it is particularly preferred that the document is automaticallygenerated by computer implemented means, and that the document isaccessible on a computer readable format (e.g. HTML, portable documentformat (pdf), postscript (ps)) and variants thereof. It is furtherpreferred that the document be made available on a server enablingsimultaneous access by multiple individuals. In another aspect of theinvention computer program products are provided. An exemplary computerprogram product comprises:

-   a) a computer code that receives as input a reference data set-   b) a computer code that receives as input a test data set-   c) a computer code that determines the statistical distance between    the reference data set and test data set or elements or subsets    thereof-   d) a computer code that identifies individual elements or subsets of    the test dataset which have a statistical distance larger than that    of a predetermined value-   e) a computer readable medium that stores the computer code.

It is further preferred that said computer program product comprises acomputer code for the reduction of the data dimensionality of thereference and test data set by means of robust embedding of the valuesinto a lower dimensional representation.

In a preferred embodiment the computer program product further comprisesa computer code that reduces the data dimensionality of the referenceand test data set by means of robust embedding of the values into alower dimensional representation. In this embodiment of the inventionthe embedding space may be calculated using one or both of the referenceand the test data sets. In one particularly preferred embodiment thecomputer code carries out the data dimensionality reduction step bymeans of a method comprising the following steps:

-   i) Projecting the data set by means of robust principle component    analysis-   ii) Removing outliers from the data set according to their    statistical distances calculated by means of one or more methods    taken from the group consisting of: Hotelling's T² distance;    percentiles of the empirical distribution of the reference data set;    Percentiles of a kernel density estimate of the distribution of the    reference data set and distance from the hyperplane of a nu-SVM,    estimating the support of the distribution of the reference data set-   iii) Calculating the embedding projection by standard principle    Component analysis and projecting the cleared or the complete data    set onto this basis vector system.

In a further preferred embodiment the computer program product furthercomprises a computer code that generates a document comprising saidelements or subsets of the test data identified by the computer code ofstep d). It is preferred that said document be generated in a readablemanner, either to the user of the computer program or by means of acomputer, and wherein said computer readable document further comprisesa graphical user interface.

EXAMPLES Example 1

In this example the method according to the invention is used to controlthe analysis of methylation patterns by means of nucleic acidmicroarrays.

In order to measure the methylation state of different CpG dinucleotidesby hybridization, sample DNA is bisulphite treated to convert allunmethylated cytosines to uracil, this treatment is not effective uponmethylated cytosines and they are consequently conserved. Genes are thenamplified by PCR using fluorescently labelled primers, in theamplificate nucleic acids unmethylated CpG dinucleotides are representedas TG dinucleotides and methylated CpG sites are conserved as CGdinucleotides. Pairs of PCR primers are multiplexed and designed tohybridise to DNA segments containing no CpG dinucleotides. This allowsunbiased amplification of multiple alleles in a single reaction. All PCRproducts from each individual sample are then mixed and hybridized toglass slides carrying a pair of immobilised oligonucleotides for eachCpG position to be analysed. Each of these detection oligonucleotides isdesigned to hybridize to the bisulphite converted sequence around aspecific CpG site which is either originally unmethylated (TG) ormethylated (CG). Hybridization conditions are selected to allow thedetection of the single nucleotide differences between the TG and CGvariants.

In the following, N_(CpG) is the number of measured CpG positions perslide, N_(S) is the number of biological samples in the study and N_(C)is the number of hybridized chips in the study. For a specific CpGposition k Î{1, . . . , N_(CpG)}, the frequency of methylated alleles insample j Î{1, . . . , N_(S)}, hybridized onto chip i Î{1, . . . , N_(C))can then be quantified as equation 1${m_{ik} = {\log\frac{{CG}_{ik}}{{TG}_{ik}}}},$where CG_(ik) and TG_(ik) are the corresponding hybridizationintensities. This ratio is invariant to the overall intensity of theparticular hybridization experiment and therefore gives a naturalnormalization of our data.

Here we will refer to a single hybridization experiment i as experimentor chip. The resulting set of measurement values is the methylationprofile m_(i)=(m_(i1), . . . , m_(iNCpG))′. We usually have severalrepeated hybridization experiments i for every single sample j. Themethylation profile for a sample j is estimated from its set ofrepetitions R_(j) by the L₁-median as m_(j)=_(xiÎRj)|m_(i)−x|₂. Incontrast to the simple component wise median this gives a robustestimate of the methylation profile that is invariant to orthogonallinear transformations such as PCA.

Data Sets

In our analysis we used data from three microarray studies. In eachstudy the methylation status of about 200 different CpG dinucleotidepositions from promoters, intronic and coding sequences of 64 genes wasmeasured.

Temperature Control: Our first set of 207 chips came from a controlexperiment where PCR amplificates of DNA from the peripheral blood of 15patients diagnosed with ALL or AML was hybridized at 4 differenttemperatures (38 C, 42 C, 44 C, 46 C). We will use this data set toprove that our method can reliably detect shifts in experimentalconditions.

Lymphoma: The second data set with an overall number of 647 chips camefrom a study where the methylation status of different subtypes ofnon-Hodgkin lymphomas from 68 patients was analyzed. All chips underwenta visual quality control, resulting in quality classification as “good”(proper spots and low background), “acceptable” (no obvious defects butuneven spots, high background or weak hybridization signals) and“unacceptable” (obvious defects). We will use this data set to identifydifferent types of outliers and show how our methods detect them.

In addition we simulated an accidental exchange of oligo probes duringslide fabrication in order to demonstrate that such an effect can bedetected by our method. The exchange was simulated in silico bypermuting 12 randomly selected CpG positions on 200 of the chips(corresponding to an accidental rotation of a 24 well oligo supply plateduring preparation for spotting).

ALL/AML: Finally we show data from a second study on ALL and AML,containing 468 chips from 74 different patients. During the course ofthis study 46 oligomeres had to be re-synthesized, some of which showeda significant change in hybridization behavior, due to synthesis qualityproblems. We will demonstrate how our algorithm successfully detectedthis systematic change in experimental conditions.

Typical Artefacts

Typical artefacts in microarray based methylation analysis are shown inFIG. 1. The plots show the correlation between single or averagedmethylation profiles. Every point corresponds to a single CpG position,the axis-values are log ratios. a) A normal chip, showing goodcorrelation to the sample average. b) A chip classified as“unacceptable” by visual inspection. Many spots showed no signal,resulting in a log ratio of 0. c) A chip classified as “good”.Hybridization conditions were not stringent enough, resulting insaturation. In many cases pairs of CG and TG oligos showed nearlyidentical high signals, giving a log ratio around 0. d) A chipclassified as “acceptable”. Hybridization signals were weak compared tothe background intensity, resulting in a high amount of noise. e)Comparison of group averages over all 64 ALL/AML chips hybridized at 42C and all 48 ALL/AML chips hybridized at 44 C. f) Comparison of groupaverages over 447 regular chips from the lymphoma data set and the 200chips with a simulated accidental probe exchange during slideproduction, affecting 12 CpG positions.

With a high number of replications for each biological sample and thecorresponding average m being reliably estimated, outlier chips can berelatively easily detected by their strong deviation from the robustsample average. In the following, we will discuss some typical outliersituations, using data from the Lymphoma experiment. In this case thehybridization of each sample was repeated at a very high redundancy of 9chips.

After identifying possible error sources the question remains how toreliably detect them, in particular if they can not be avoided withabsolute certainty. One aim of the invention is therefore to excludesingle outlier chips from the analysis and to detect systematic changesin experimental conditions as early as possible in order to facilitate afast recalibration of the production process.

Detecting Outlier Chips with Robust PCA

Methods

As a first step we want to detect single outlier chips. In contrast tostandard statistical approaches based on image features of single slideswe will use the overall distribution of the whole experimental series.This is motivated by the fact that although image analysis algorithmswill successfully detect bad hybridization signals, they will usuallyfail in cases of unspecific hybridization. The aim is to identify theregion in measurement space where most of the chips m_(i), i=1 . . .N_(c), are located. The region will be defined by its center and anupper limit for the distance between a single chip and the regioncenter. Chips with deviations higher than the upper limit will beregarded as outliers.

A simple approach is to independently define for every CpG position kthe deviation from the center μ_(k) as t_(k)=|m_(ik)−μ_(k)|s_(k)hereinafter referred to as Equation 3, where μ_(k)=(1/N)_(i)m_(ik) isthe mean and s² _(k)=1/(N−1)_(i)(m_(ik)−μ_(k))² is the sample varianceover all chips. Assuming that the m_(ik) are normally distributed, t_(k)multiplied by a constant follows a t-distribution with N−1 degrees offreedom. This can be used to define the upper limit of the admissibleregion for a given significance level α.

However, a separate treatment of the different CpG positions is onlyoptimal when their measurement values are independent. As FIG. 2demonstrates it is important to take into account the correlationbetween different dimensions. It is possible that a point which is notdetected as an outlier by a component wise test is in reality an outlier(e.g. P₁ in FIG. 2). On the other hand, there are points that will beerroneously detected as outliers by a component wise test (e.g. P₂ inFIG. 2). Because microarray data usually have a very high correlation,it is better to use a multivariate distance concept instead of thesimple univariate t_(k)-distance. A natural generalization of thet_(k)-distance is given by Hotelling's T² statistic, defined as Equation4: T²(i) = (m_(i) − μ)^(′)S⁻¹(m_(i) − μ), with  mean$\mu = {\left( {1/N_{C}} \right){\sum\limits_{i = 1}^{N_{C}}{m_{i}\quad{and}\quad{sample}\quad{covariance}\quad{matrix}}}}$$S = {{1/\left( {N_{C} - 1} \right)}{\sum\limits_{i = 1}^{N_{C}}{\left( {m_{i} - \mu} \right){\left( {m_{i} - \mu} \right)^{\prime}.}}}}$

Assuming that the m_(i) are multivariate normally distributed, T²multiplied by a constant follows a F-distribution with N_(C)-N_(CpG)degrees of freedom and the non-centrality parameter N_(CpG). This can beused to define the upper limit of the admissible region for a givensignificance level α.

Two problems arise when we want to use the T²-distance for microarraydata:

-   -   1. For less chips N_(C) than measurements N_(CpG), the sample        covariance matrix S is singular and not invertible.    -   2. The estimates for μ and S are not robust against outliers.

The first problem can be addressed by using principle component analysis(PCA) to reduce the dimensionality of our measurement space. This isdone by projecting all methylation profiles m_(i) onto the first deigenvectors with the highest variance. As a result we get thed-dimensional centered vectors i=P_(PCA)(m_(i)−μ) in eigenvector space.After the projection, the covariance matrix=diag(1, . . . , d) of thereduced space is a diagonal matrix and the T²-distance of Equation 4 isapproximated by the T²-distance in the reduced space${{\overset{\sim}{T}}^{2}(i)} = {\sum\limits_{r = 1}^{d}{\frac{{\overset{\sim}{m}}_{ir}^{2}}{{\overset{\sim}{s}}_{r}^{2}}.}}$

Under the assumption that the true variance is equal to {tilde over(S)}_(j), {tilde over (T)}² follows a χ² distribution with d degrees offreedom. This can be used to define the upper significance level α.However the problem remains that the estimated eigenvectors andvariances {tilde over (s)}_(j) are not robust against outliers.

We propose to solve the problem of outlier sensitivity together with thedimension reduction step by using robust principle component analysis(rPCA). rPCA finds the first d directions with the largest scale in dataspace, robustly approximating the first d eigenvectors. The algorithmstarts with centering the data with a robust location estimator. Here wewill use the L₁ median according to Equation 6:$\mu_{L1} = {\underset{x}{\arg\quad\min}{\sum\limits_{i = 1}^{N_{C}}{{{m_{i} - x}}_{2}.}}}$

In contrast to the simple component-wise median, this gives a robustestimate of the distribution center that is invariant to orthogonallinear transformations such as PCA. Then all centered observations areprojected onto a finite subset of all possible directions in measurementspace. The direction with maximum robust scale is chosen as anapproximation of the largest eigenvector (e.g. by using the Q_(n)estimator). After projecting the data into the orthogonal subspace ofthe selected “eigenvector” the procedure searches for an approximationof the next eigenvector. Here the finite set of possible directions issimply chosen as the set of centered observations themselves.

After obtaining the robust projection of our data into a d dimensionalsubspace we can compute the upper limit of the admissible region ²UCL,also referred to as the upper control limit (UCL). For a givensignificance level iz it is computed as Equation 7:{tilde over (T)} _(UCL) ²=χ_(d.1-α).

Every observation m_(i) with T²(i)>²UCL is regarded as an outlier.

Results

In order to test how the rPCA algorithm works on microarray data weapplied it to the Lymphoma dataset and compared its performance toclassical PCA. The results are shown in FIG. 3.

The rPCA algorithm detected 97% of the chips with “unacceptable”quality, whereas classical PCA only detected 29%. 10% of the“acceptable” chips were detected as outliers by rPCA, whereas PCAdetected 3%. rPCA detected 21 chips as outliers which were classified as“good”. These chips have all been confirmed to show saturatedhybridization signals, not identified by visual inspection. This meansrPCA is able to detect nearly all cases of outlier chips identified byvisual inspection. Additionally rPCA detects microarrays which haveunconspicous image quality but show an unusual hybridization pattern.

An obvious concern with this use of rPCA for outlier detection is thatit relies on the assumption of normal distribution of the data If thedistribution of the biological data is highly multi-modal, biologicalsubclasses may be wrongly classified as outliers. To quantify thiseffect we simulated a very strong cluster structure in the Lymphoma databy shifting one of the smaller subclasses by a multiple of the standarddeviation. Only when the measurements of all 174 CpG of the subclasswhere shifted by more than 2 standard deviations a considerable part ofthe biological samples were wrongly classified as outliers. In order toavoid such a misclassification, we tolerate at most 50% of repeatedmeasurements of a single biological sample to be classified as outliers.However, we never reached this threshold in practice.

Statistical Process Control

Methods

In the last section we have seen how outliers can be detected solely onthe basis of the overall data distribution. Statistical process controlexpands this approach by introducing the concept of time. The aim is toobserve the variables of a process for some time under perfect workingconditions. The data collected during this period form the so calledhistorical data set (HDS), also referred to above as the ‘reference dataset’. Under the assumption that all variables are normally distributed,the mean μ_(HDS) and the sample covariance matrix S_(HDS) of thehistorical data set fully describe the statistical behavior of theprocess.

Given the historical data set it becomes possible to check at any timepoint, I, how far the current state of the process has deviated from theperfect state by computing the T²-distance between the ideal processmean μ_(HDS) and the current observation m_(i). This corresponds toEquation 4 with the overall sample estimates μ and S replaced by theirreference counterparts μ_(HDS) and S_(HDS). Any change in the processwill cause observations with greater T²-distances. To decide whether anobservation shows a significant deviation from the HDS we compute theupper control limit as in Equation 8:${T_{UCL}^{2} = {\frac{{p\left( {n + 1} \right)}\left( {n - 1} \right)}{n\left( {n - p} \right)}F_{p,{n - p},{1 - \alpha},}}},$where p is the number of observed variables, n is the number ofobservations in the HDS, α is the significance level and F is theF-distribution with n-p degrees of freedom and the non-centralityparameter p. Whenever T²>T² _(UCL) is observed the process has to beregarded as significantly out of control.

In our case the process to control is a microarry experiment and theonly process variables we have observed are the log ratios of the actualhybridization intensities. A single observation is then a chip m_(i) andthe HDS of size N_(HDS) is defined as (m₁, . . . , m_(NHDS)}. We have tobe aware of a few important issues in this interpretation of statisticalprocess control. First, our data has a multi-modal distribution whichresults from a mixture of different biological samples and classes.Therefore the assumption of normality is only a rough approximation andT² _(UCL) from Equation should be regarded with caution. Secondly, as wehave seen in the last sections, microarray experiments produce outliers,resulting in transgression of the UCL. This means sporadic violations ofthe UCL are normal and do not indicate that the process is out ofcontrol. The third issue is that we have to use the assumption that amicroarray study will not systematically change its data generatingdistribution over time. Therefore the experimental design has to berandomized or block randomized, otherwise a systematic change in thecorrectly measured biological data will be interpreted as an out ofcontrol situation (e.g. when all patients with the same disease subtypeare measured in one block). Finally, the question remains of what timemeans in the context of a microarray experiment. Beside the biologicalvariation in the data, there are a multitude of different parameterswhich can systematically alter the final hybridization intensities. Theexperimental series should stay constant with regard to all of them. Inour experience the best initial choice is to order the chips by theirdate of hybridization, which shows a very high correlation to mostparameters of interest.

Although it is certainly interesting to look how single hybridizationexperiments m_(i) compare to the HDS, we are more interested in how thegeneral behavior of the chip process changes over time. Therefore wedefine the current data set (CDS) (also referred to above as the testdata set) as {m_(i-NCDS/2), . . . , m_(i), . . . , m_(i+NCDS/2)}, wherei is the time of interest This allows us to look at the datadistribution in a time interval of size N_(CDS) around i. In analogy tothe classical setting in statistical process control we can define theT²-distance between the HDS and the CDS as in Equation 9:T _(w) ²(i)=(μ_(HDS)−μ_(CDS))^(T) ^(S)where {tilde over (S)} is calculated from the sample covariance matricesS_(HDS) and S_(CDS) as $\begin{matrix}{{Equation}\quad 10\text{:}} & \quad \\{\overset{\_}{S} = {\frac{{\left( {N_{HDS} - 1} \right)S_{HDS}} + {\left( {N_{CDS} - 1} \right)S_{CDS}}}{N_{HDS} + N_{CDS} - 2}.}} & \quad\end{matrix}$

Although it is possible to use T² _(w)-distance between the historicaland current data set to test for μ_(HDS)=μ_(CDS), this information isrelatively meaningless. The hypothesis that the means of HDS and CDS areequal would almost always be rejected, due to the high power of thetest. What is of more interest is T itself, which is the amount by whichthe two sample means differ in relation to the standard deviation of thedata.

In order to see whether an observed change of the T² _(w)-distance comesfrom a simple translation it is also interesting to compare the twosample covariances S_(HDS) and S_(CDS). A translation in log(CG/TG)space means that the hybridization intensities of HDS and CDS differonly by a constant factor (e.g. a change in probe concentration). Thissituation can be detected by looking at${{L(i)} = {2\left\lbrack {{\ln{\overset{\_}{S}}} - {\frac{N_{HDS} - 1}{N_{HDS} + N_{CDS} - 2}\ln{S_{HDS}}} - {\frac{N_{CDS} - 1}{N_{HDS} + N_{CDS} - 2}\ln{S_{CDS}}}} \right\rbrack}},$which is the test statistics of the likelihood ratio test for differentcovariance matrices. It gives a distance measure between the twocovariance matrices (i.e. L=0 means equal covariances).

Before we can apply the described methods to a real microarray data setwe have again to solve the problem that we need a non-singular andoutlier resistant estimate of S_(HDS) and S_(CDS). What makes theproblem even harder than is that we cannot a priori know how a change inexperimental conditions will affect our data. In contrast to the lastsection, the simple approximation of S_(HDS) by its first principlecomponents will not work here. The reason is that changes in theexperimental conditions outside the HDS will not necessarily berepresented in the first principole components of S_(HDS).

The solution is to first embed all the experimental data into a lowerdimensional space by PCA. This works, because any significant change inthe experimental conditions will be captured by one of the firstprinciple components. S_(HDS) and S_(CDS) can then be reliably computedin the lower dimensional embedding. The problem of robustness is simplysolved by first using robust PCA to remove outliers before performingthe actual embedding and before computing the sample covariances. Asummary of our algorithm is:

-   -   1. Order chips according to the parameter of interest e.g. date        of hybridisation.    -   2. Take the set of ordered chips [m₁, . . . , m_(N) _(C) ]        remove outliers with rPCA for computing the first d eigenvectors        with classical PCA.    -   3. Project the set of all ordered chips {m₁, . . . , m_(N) _(C)        }, into the d-dimensinal subspace spanned by the computed        vectors.    -   4. Select the first N_(HDS) chips {m₁, . . . , m_(N) _(HDS) } as        historical data set, remove outliers with rPCA for computing        μ_(HDS) and S_(HDS).    -   5. For every time index iε{1, . . . , N_(C)}    -   (a) Compute T² distance between m_(i) and μ_(HDS).        ${(b)\quad{If}\quad\frac{N_{CDS}}{2}} < i < {N_{C} - \frac{N_{CDS}}{2}}$        i.  Select{m_(i) − N_(CDS)/2, …  , m_(i), …  , m_(i) + N_(CDS)/2}  as  current        data  set, remove  outliers  with  rPCA  for  computing  μ_(CDS)        and  S_(CDS).ii.  Compute  T_(w)² − distance  between  μ_(HDS)  and  μ_(CDS).iii.  Compute  L − distance  between  S_(HDS)  and  S_(CDS).    -   6. Generate controlling chart by plotting T². T_(w) ² and L

With the computed values for T², T² _(w) and L we can generate a plotthat visualizes the quality development of the chip process over time, aso called T² control chart.

Results

The first example is shown in FIG. 4, which demonstrates how ouralgorithm detects a change in hybridization temperature. As can beexpected the T²-value grows with an increase in hybridizationtemperature. The systematic increase of the L-distance indicates thatthis is not only caused by a simple translation in methylation space.The process has to be regarded as clearly out of control, due to theobservation that almost all chips are above the UCL after thetemperature change and the process center has drifted more than T_(w)=4standard deviations away from its original location.

FIG. 6 shows how our method detects the simulated handling error in theLymphoma data set The affected chips can be clearly identified by thesignificant increase in the T²-distances as well as by their change inthe covariance structure.

Finally, FIG. 5 shows the T² control chart of the ALL/AML study. Itclearly indicates that the experimental conditions significantly changedtwo times over the course of the study. A look at the L-distance revealsthat the covariance within the two detected artefact blocks is identicalto the HDS. A change in covariance can be detected only when the CDSwindow passes the two borders. This clearly indicates that the observedeffect is a simple translation of the process mean.

The major practical problem is now to identify the reasons for thechanges. In this regard the most valuable information from the T²control chart is the time point of process change. It can becross-checked with the laboratory protocol and the process parameterswhich have changed at the same time can be identified. In our case thetwo process shifts corresponded to the time of replacement ofre-synthesized probe oligos for slide production, which were obviouslydelivered at a wrong concentration. After exclusion of the affected CpGpositions from the analysis the T² chart showed normal behavior and theoverall noise level of the data set was significantly reduced.

Discussion

Taken together, we have shown that robust principle components analysisand techniques of statistical process control can be used to detectflaws in microarray experiments. Robust PCA has proven to be able toautomatically detect nearly all cases of outlier chips identified byvisual inspection; as well as microarrays with unconspicous imagequality but saturated hybridization signals. With the T² control chartwe introduced a tool that facilitates the detection and assessment ofeven minor systematic changes in large scale microarray studies.

A major advantage of both methods is that they do not rely on anexplicit modeling of the microarray process as they are solely based onthe distribution of the actual measurements. Having successfully appliedour methods to the example of DNA methylation data, we assume that thesame results can be achieved with other types of microarray platforms.The sensitivity of the methods improve with increasing study sizes, dueto their multivariate nature. This makes them particularly suitable formedium to large scale experiments in a high throughput environment.

The retrospective analysis of a study with our methods can greatlyimprove results and avoid misleading biological interpretations. Whenthe T² control chart is monitored in real time a given quality level canbe maintained in a very cost effective way. On the one hand, this allowsfor an immediate correction of process parameters. On the other hand,this makes it possible to specifically repeat only those slides affectedby a process artefact This guarantees high quality while minimizing thenumber of repetitions.

A general shortcoming of T² control charts is that they only indicatethat something went wrong, but not what was exactly the source.Therefore we have used the time at which a significant change happenedin order to identify the responsible process parameter. We have shownhow a quantification of the change in covariance structure providesadditional information and permits to discriminate between differentproblems like changes in probe concentration and accidental handlingerrors.

Example 2

In one aspect, the method according to the disclosed invention providesa means for automatically generating a concise report based on thedisclosed methods for quality monitoring of laboratory processperformance. In the disclosed embodiment this report is structured insections starting with summary table (see Table 1) of the performancegrades for several evaluation categories of the individual experimentunits, a section detailing each evaluation category in turn in a tableof grades for this category, the corresponding performance variables thegrades are based on and a set of graphical displays implemented as panelof box plots (see FIG. 7) displaying the thresholds used for grading,and a table of details containing all evaluation grades for eachexperimental unit. The report can be generated by means of a computerprogram which outputs the result in file formats HTML, Adobe PDF,postscript, and variants thereof TABLE 1 Rob. Vis. PCA - Chip Grade Thr.BG SPOT GEO SAT 0100870030-68406- 3 −0.9 bad good bad good 571150100870296-68421- 2 −1.5 bad good bad good 57110 0100870569-68422- 2−2.7 bad good bad good 57121 0100870907-68447- 2 −2. bad good bad good57105 0100870949-68451- 2 −1.8 dubious good bad good 571270100871228-68460- 2 −1.9 dubious good bad good 57104 0l00871947-68487- 1−1.6 dubious good bad good 57109 0100871997-68491- 2 −2.1 bad good badgood 57128 0100872531-68503- 6 5.6 bad good good good 571030100872549-68495- 1 2.3 bad good bad good 57112 0100872573-68504- 2 −0.2bad good bad good 57129 0100872812-68517- 2 −1.4 dubious good bad good57106 0100870056-68408- 3 −1.8 bad good bad good 57133 0100870072-68410-3 −2.1 bad good bad good 57139

Table 1 shows the summary table of category grades for each experimentalunit: From left to right, the columns represent the identifier of theexperimental unit, the human expert visual grade, the distance for theexperimental unit from the estimate the robust mean location of the setof experiments, the background category grade, the spot characteristiccategory grade, the geometry characteristic grade and the intensitysaturation category grade are stated. Three grade levels are used, good,dubious, bad, based on the grades calculated for each category in turn.

Table 2 shows the complete summary table of all chips analysed in study‘1’ according to FIG. 7, of which Table 1 represents the mostinformative subset TABLE 2 Rob. vis. PCA - Chip Grade Thr. BG SPOT GEOSAT 0100870030- 3 −0.9 bad good bad good 68406-57115 0100870296- 2 −1.5bad good bad good 68421-57110 0100870569- 2 −2.7 bad good bad good68422-57121 0100870907- 2 −2 dubious good bad good 68447-571050100870949- 2 −1.8 dubious good bad good 68451-57127 0100871228- 2 −1.9dubious good bad good 68460-57104 0100871947- 1 −1.6 dubious good badgood 68487-57109 0100871997- 2 −2.1 bad good bad good 68491-571280100872531- 6 5.6 bad good good good 68503-57103 0100872549- 1 2.3 badgood bad good 68495-57112 0100872573- 2 −0.2 bad good bad good68504-57129 0100872812- 2 −1.4 dubious good bad good 68517-571060100870056- 3 −1.8 bad good bad good 68408-57133 0100870072- 3 −2.1 badgood bad good 68410-57139 0100870098- 3 −1.2 good good bad good68412-57145 0100870171- 3 −1.3 good good bad good 68417-571830100870402- 2 −2.2 dubious good bad good 68426-57164 0100870527- 2 −2.6bad good bad good 68437-57107 0100870600- 2 −0.8 bad good bad good68439-57146 0100870642- 3 −1.5 bad good bad good 68442-57165 0100870725-2 −0.7 bad good good good 68444-57185 0100870923- 3 −2.5 dubious goodbad good 68449-57117 0100870965- 2 −1 dubious good bad good 68453-571400100870981- 2 −1.5 dubious good bad good 68438-57143 0100871004- 2 −1.8dubious good bad good 68441-57153 0100871020- 2 −2.4 good good bad good68455-57166 0100871046- 2 −1.9 bad good bad good 68443-57172 0100871062-2 −1.1 bad good bad good 68445-57180 0100871301- 2 −1.8 good good badgood 68464-57141 0100871343- 2 −1.5 dubious good bad good 68467-571600100871632- 2 −2.1 bad good bad good 68478-57119 0100871674- 2 −2dubious good bad good 68468-57136 0100871757- 3 −1.9 bad good bad good68470-57157 0100871799- 2 −0.7 bad good bad good 68482-57167 0100871822-3 −2.1 good good bad good 68483-57176 0100871830- 3 −1.2 dubious goodbad good 68472-57179 0100872185- 3 −0.1 bad good bad good 68484-571380100872226- 3 −0.8 dubious good bad good 68492-57149 0100872268- 3 −2.1dubious good bad good 68494-57154 0100872309- 2 −2 dubious good bad good68494-57168 0100872341- 2 −1.5 dubious good bad good 68488-571740100872383- 2 −1.1 bad good dubious good 68496-57187 0100872581- 2 −0.6bad good bad good 68506-57142 0100872614- 2 −2 bad good bad good68508-57150 0100872622- 2 −1.4 bad good bad good 68498-57152 0100872656-2 −2.7 bad good bad good 68510-57169 0100872664- 2 −1.9 bad good badgood 68512-57175 0100872698- 2 −2 bad good bad good 68500-571810100872820- 2 −1.2 bad good bad good 68509-57113 0100872854- 2 −1.9 badgood bad good 68511-57132 0100872896- 2 2.7 bad good bad good68514-57137 0100872903- 2 −2 bad good bad good 68519-57151 0100872937- 2−1.9 bad good bad good 68516-57155 0100872979- 2 −0.8 dubious good badgood 68521-57178 0100873068- 3 −2.9 dubious good bad good 68405-571820100870212- 3 −1.7 bad good bad good 68403-57198 0100870246- 3 −0.4dubious good bad good 68559-57265 0100870254- 3 −1.7 bad good good good68404-57216 0100870288- 2 −2.2 good good bad good 68527-572330100870329- 3 −0.3 bad good bad good 68529-57235 0100870361- 3 −1.5dubious good bad good 68555-57261 0100870444- 3 −1 bad good bad good68432-57195 0l00870452- 3 1.9 bad good bad good 68433-57204 0100870486-2 −2.7 dubious good bad good 68418-57215 0100870759- 3 −2.2 bad good badgood 68528-57234 0100870767- 2 −2.1 dubious good bad good 68429-571910100870791- 3 −2 bad good bad good 68531-57237 0100870808- 2 −0.8dubious good bad good 68431-57197 0100870832- 3 −1.3 dubious good badgood 68533-57239 0100870840- 2 −1.5 dubious good bad good 68434-572080100870866- 2 −3.2 dubious good bad good 68446-57220 0100870882- 2 −2.1bad good bad good 68436-57223 0100870915- 2 −1.3 dubious good bad good68536-57242 0100870957- 3 3.2 bad good bad good 68532-57238 0100870999-3 −1.2 bad good bad good 68538-57244 0100871070- 3 −2.3 dubious good badgood 68543-57249 0100871088- 3 −0.8 dubious good bad good 68448-571900100871103- 4 −1.2 bad good good good 68450-57199 0100871129- 3 −2.5 badgood bad good 68452-57210 0100871145- 3 −1.9 bad good bad good68535-57241 0100871161- 2 −1.5 bad good bad good 68457-57221 0100871187-2 −2.2 dubious good bad good 68547-57253 0100871195- 2 −0.8 good goodbad good 68550-57256 0100871236- 3 −0.3 bad good dubious good68537-57266 0100871260- 2 −2.4 dubious good bad good 68552-572580100871278- 3 −1.1 dubious good bad good 68539-57245 0100871335- 2 −0.9dubious good bad good 68541-57247 0100871385- 4 0.5 bad good good good68542-57248 0100871468- 4 −2.2 good good bad good 68461-572000100871476- 2 −1.5 bad good bad good 68548-57254 0100871517- 2 −3.1dubious good bad good 68558-57264 0100871559- 3 −2.7 dubious good badgood 68463-57217 0100871591- 2 −2.1 dubious good bad good 68475-572280100871864- 2 −2.4 bad good bad good 68485-57196 0100871872- 2 −1.1 badgood bad good 68474-57201 0100871898- 2 −2.5 dubious good bad good68477-57209 0100871905- 3 −2.6 bad good bad good 68479-57218 0100871913-2 −0.7 dubious good bad good 68481-57225 0100872101- 3 −2.4 dubious goodbad good 68551-57257 0100872143- 3 −2.1 bad good good good 68553-572590100872424- 3 −1 dubious good bad good 68490-57188 0100872458- 3 −1.9bad good bad good 68497-57205 0100872466- 2 −0.9 bad good bad good68499-57213 0100872490- 2 −0.7 bad good bad good 68493-57219 0100872507-3 −0.9 bad good bad good 68501-57229 0100872705- 3 −2.5 bad good badgood 68502-57193 0100872739- 2 −2 bad good bad good 68505-572020100872747- 2 −0.7 bad good bad good 68513-57214 0100872771- 2 −1.8dubious good bad good 68515-57222 0100872789- 2 −0.2 bad good bad good68524-57230 0100872862- 2 −0.5 bad good bad good 68560-57267 0100872987-2 −3.1 bad good bad good 68526-57232 0100873183- 4 −2.2 bad good badgood 68401-57207 0100870022- 3 −1.2 good good bad good 68703-574100100870048- 5 −0.8 bad good bad good 68704-57411 0100870080- 3 −2.6dubious good bad good 68562-57271 0100870105- 3 −0.9 bad good bad good68701-57408 0100870121- 3 −1.5 dubious good bad good 68564-572730100870147- 3 −1.3 bad good bad good 68699-57406 0100870163- 4 −0.8dubious good bad bad 68563-57269 0l00870l89- 3 0.4 bad good bad good68700-57407 0100870204- 3 −1.1 bad good bad good 68565-57268 0100870775-3 −2.7 dubious good bad good 68696-57403 0100870816- 3 −0.7 bad good badgood 68698-57405 0100870858- 3 −1.7 dubious good bad good 68697-574040100870890- 3 −0.7 bad good bad good 68575-57281 0100870931- 3 1 badgood good good 68576-57283 0100871012- 3 −1.6 dubious good bad good68691-57398 0100871054- 3 −2.2 bad good bad good 68692-57399 0100871096-2 −0.6 good good bad good 68638-57345 0100871137- 2 −2.4 bad good badgood 68636-57343 0100871179- 5 −0.9 dubious dubious bad good 68650-573570100871210- 3 −1.6 bad good bad good 68706-57413 0100871252- 3 −2dubious good bad good 68649-57356 0100871294- 3 −1 bad good bad good68635-57342 0100871418- 2 −1.2 bad good bad good 68615-57322 0100871450-2 −2 dubious good bad good 68678-57385 0100871492- 2 −0.7 dubious goodbad good 68677-57384 0100871533- 5 −2.1 dubious good bad good68676-57383 0100871541- 3 −0.4 dubious good bad good 68645-573520100871583- 3 −1.7 bad good bad good 68643-57350 0100871624- 2 −2.3dubious good bad good 68644-57351 0100871666- 5 −1.5 bad good bad good68642-57349 0100871707- 2 −2.9 dubious good bad good 68641-573480100871731- 3 −3.4 bad good bad good 68571-57277 0100871773- 2 3.6 badgood bad good 68572-57278 0100871781- 4 −1.7 bad good bad good68675-57382 0100871814- 3 −0.7 bad good bad good 68573-57282 0100871856-2 −0.1 bad good bad good 68574-57280 0100871939- 2 −2 bad good bad good68561-57270 0100871971- 4 −1.7 bad good bad good 68569-57276 0100871989-2 −2.3 bad good bad good 68570-57279 0100872135- 3 −2.3 dubious good badgood 68651-57358 0100872177- 3 −1.4 bad good bad good 68652-573590100872218- 4 −2.3 dubious dubious bad good 68653-57360 0100872250- 4−0.6 good good bad good 68654-57361 0100872292- 3 −2.2 dubious good badgood 68655-57362 0100872333- 4 −0.5 bad dubious bad good 68656-573630100872375- 3 −0.6 good good bad good 68656-57397 0100872416- 3 −1.3 badgood bad good 68689-57396 0100873018- 3 −1.3 bad good bad good68601-57308 0100873026- 3 −2.4 dubious good bad good 68602-573090100873050- 5 −1.5 bad good bad good 68659-57366 0100873076- 3 −1.2 badgood bad good 68578-57285 0100873084- 5 −1.4 bad good bad good68664-57371 0100873117- 3 −1.6 bad good bad good 68581-57288 0100873133-3 −1.8 bad good bad good 68679-57386 0100873159- 3 −1.6 bad good badgood 68580-57287 0100873175- 2 −2.4 bad good bad good 68681-573880100873191- 2 −1.4 bad good bad good 68630-57337 0100873216- 2 −2.3 badgood bad good 68682-57389 0100873224- 3 −1.6 dubious good bad good68627-57334 0100873232- 2 −0.3 bad good bad good 68629-57336 0100873258-2 −2.1 dubious good bad good 68684-57391 0100873266- 3 −1.3 bad good badgood 68628-57335 0100873274- 3 −1.1 bad good bad good 68631-573380100873290- 3 −2 dubious good bad good 68683-57390 0100873307- 3 −1.8bad good bad good 68625-57332 0100873315- 2 −0.4 bad good bad good68586-57293 0100873331- 3 −1.9 bad good bad good 68686-57393 0100873349-4 −2.3 dubious good bad good 68626-57333 0100873357- 2 −1.3 bad good badgood 68585-57292 0100873373- 5 −1.5 bad dubious bad good 68685-573920100873381- 2 −2.7 bad good bad good 68639-57346 0100873399- 3 −1.8 badgood bad good 68589-57296 0100873414- 2 −0.4 bad good dubious good68687-57394 0100873422- 2 −2.2 bad good bad good 68624-57331 0100873430-3 −1.1 bad good bad good 68587-57294 0100873456- 3 −1.3 bad good badgood 68688-57395 0100873464- 3 −1.6 dubious good bad good 68666-573730100873472- 6 −2 dubious good bad good 68640-57347 0100873498- 2 −2.5bad good bad good 68665-57372 0100873505- 2 −2.2 bad good bad good68667-57374 0100873513- 3 −1.2 good good bad good 68588-572950100873539- 2 −1.6 bad good bad good 68596-57303 0100873547- 3 −1.3 badgood bad good 68647-57354 0100873555- 4 −0.7 bad good bad good68590-57297 0100873571- 2 −1 bad good bad good 68598-57305 0100873589- 5−2.3 bad good bad good 68648-57355 0100873612- 3 0 bad good bad good68597-57304 0100873646- 3 −0.4 bad good bad good 68595-57302 0100873654-4 −2.7 bad good bad good 68600-57307 0100873662- 2 −1.8 dubious good badgood 68669-57376 0100873696- 2 −2.1 bad good bad good 68599-573060100873703- 2 −1.7 dubious good bad good 68670-57377 0100873737- 4 −1.4bad good bad good 68582-57289 0100873745- 4 −0.8 dubious good bad good68671-57378 0100873779- 3 −2.6 bad good bad good 68583-57290 0100873787-5 −2 bad good bad good 68672-57379 0100873810- 3 −1.8 dubious good badgood 68584-57291 0100873828- 2 −1.8 dubious good bad good 68657-573640100873852- 3 −2.2 bad good bad good 68607-57314 0100873860- 2 −1.1 badgood dubious good 68662-57418 0100873894- 3 −2.6 dubious good bad good68605-57312 0100873901- 2 −1.9 bad good bad good 68637-57344 0100873935-2 −0.8 dubious good bad good 68606-57313 0100873943- 3 −2.2 bad good badgood 68577-57284 0100873969- 3 −0.8 dubious good bad good 68661-573680100873977 3 −1.8 bad good bad good 68604-57311 0100873985- 2 −1.9 badgood bad good 68591-57298 0100874008- 2 −1.7 bad good bad good68663-57370 0100874016- 3 −2.3 bad good bad good 68603-57310 0100874024-4 −1.1 bad good bad good 68579-57286 0100874040- 2 −0.7 bad good badgood 68673-57380 0100875147- 4 −1.8 bad good bad good 68717-574260100875345- 3 −1.8 bad good bad good 68719-57428 0100875387- 3 −1.6dubious good bad good 68720-57429 0100875428- 3 −0.1 dubious good badgood 68716-57425 0100874157- 2 −1.5 bad good bad good 68787-575180100874404- 3 3.3 bad good bad good 68773-57504 0100874446- 3 −2.5 badgood bad good 68771-57502 0100874488- 2 −1.5 bad good bad good68800-57531 0100874529- 2 0 bad good bad good 68796-57527 0100874553- 2−1.9 bad good bad good 68792-57523 0100874561- 3 −1.6 bad good bad good68798-57529 0100874595- 2 −0.5 bad good bad good 68794-57525 0100874602-3 −1.4 bad good bad good 68775-57506 0100874628- 3 −2.7 bad good badgood 68808-57543 0100874636- 2 −0.9 bad good bad good 68788-575190100874678- 3 −2.3 bad good bad good 68791-57522 0100875098- 2 −1.8 badgood bad good 68721-57431 0100875121- 2 −1.2 bad good bad good68735-57451 0100875139- 2 −1 bad good bad good 68723-57443 0100875163- 2−1.3 bad good bad good 68733-57454 0100875171- 2 −1.9 bad good bad good68768-57499 0100875204- 2 3.8 bad good bad good 68732-57452 0100875212-3 −2.5 bad good bad good 68767-57498 0100875246- 2 −1.2 bad good badgood 68730-57453 0100875254- 2 −2.5 bad good bad good 68765-574890100875288- 2 −2.1 bad good bad good 68728-57448 0100875296- 4 −2.2dubious good bad good 68815-57550 0100875379- 3 −1.7 dubious good badgood 68763-57482 0100875410- 3 −2.9 good good bad good 68762-574810100875452- 3 −1.2 good good bad good 68810-57544 0100875494- 2 −2.1dubious good bad good 68759-57478 0100875535- 2 −1.7 bad good bad good68811-57545 0100875577- 3 −2.2 dubious good bad good 68814-575470100875593- 3 −1.9 bad good bad good 68785-57516 0100875618- 2 −1.7dubious good bad good 68756-57475 0100875759- 3 −0.8 bad good bad good68776-57507 0100875791- 2 −3.3 bad good bad good 68774-57505 0100875816-3 −0.3 bad good bad good 68738-57457 0100875824- 3 −1.1 dubious good badgood 68769-57500 0100875832- 2 −2.4 dubious good bad good 68772-575030100875858- 2 −1.5 bad good bad good 68739-57458 0100875866- 2 −1.3 badgood bad good 68726-57446 0100875915- 3 −1.4 dubious good bad good68742-57461 0100875957- 3 −2 bad good bad good 68741-57460 0100875999- 3−1.5 bad good bad good 68740-57459 0100876012- 3 −0.1 bad good bad good68737-57456 0100876038- 2 −2.3 bad good bad good 68755-57474 0100876054-3 −1.5 bad good bad good 68736-57455 0100876070- 2 −2 bad good bad good68813-57546 0100876096- 3 −2.3 bad good bad good 68784-57515 0100876103-3 0.2 bad good bad good 68745-57464 0100876137- 3 −2.1 bad good bad good68786-57517 0100876145- 3 −0.7 bad good bad good 68746-57465 0100876179-2 −1.7 bad good bad good 68780-57511 0100876187- 3 −1.3 bad good badgood 68747-57466 0100876210- 3 −1.1 bad good bad good 68812-575490100876228- 3 −1.9 bad good bad good 68748-57467 0100876252- 3 −1.6 badgood bad good 68777-57508 0100876260- 3 −1.6 bad good bad good68749-57468 0100876301- 3 −0.8 dubious good bad good 68750-574690100876335- 3 −3.4 bad good bad good 68790-57521 0100876343- 3 −1 badgood bad good 68754-57473 0100876377- 3 −1.2 bad good bad good68793-57524 0100876418- 3 −3.2 bad good bad good 68795-57526 0100876450-3 −2.2 bad good bad good 68797-57528 0100876492- 3 −0.9 bad good badgood 68799-57530 0100876533- 3 −1.9 bad good bad good 68801-575320100876575- 3 −1.4 bad good bad good 68802-57533 0100876616- 3 −1.6dubious good bad good 68751-57470 0100876690- 3 −1.7 dubious good badgood 68743-57462 0100876773- 3 −1.1 bad good bad good 68803-575340100876814- 3 −0.3 bad good bad good 68805-57536 0100876856- 3 −0.3 badgood bad good 68804-57535 0100876898- 3 −1.9 bad good bad good68807-57538 0100876939- 3 −2.6 bad good bad good 68807-57538 0100877052-3 −0.6 bad good bad good 68744-57463 — — bad 198 (57.9%)  0 (0%) 290(84.8%)  1 (0.3%) — — dubious 125 (36.5%)  4 (1.2%)  15 (4.4%)  0 (0%) —— good  19 (5.6%) 338 (98.8%)  37 (10.8%) 341 (99.7%)

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1: Typical artefacts in mcroarray based hybridisation signals. Theplots show the correlation between single or averaged hybridisationprofiles. ‘A’ shows a typical chip classified as “good”. The smallrandom deviations from the sample median are due to the approximatelynormally distributed experimental noise. A typical chip classified as“unacceptable” by visual inspection is shown in ‘B’. Many spots showedno signal, resulting in a log ratio of=after thresholding the signals toX>0. The opposite case is shown in FIG. 1 c. This chip has very stronghybridization signals and was classified as “good” by visual inspectionHowever, the hybridization conditions have been too unspecific and mostof the oligos were saturated. ‘D’ shows a chip classified as“acceptable”. Hybridization signals were weak compared to backgroundintensity, resulting in a high amount of noise. ‘E’ shows the comparisonof group averages over 64 chips in a study hybridised at 42° C. and 48chips from the same study hybridised at 44° C. ‘F’ shows the comparisonof group averages over 447 regular chips from one study and 200 chipswith a simulated accidental probe exchange during slide productionaffecting 12 positions on the chip.

FIG. 2: Comparison between univariate (central rectangle) andmulivariate (ellipse) upper confidence intervals. P₁ is not detected asoutlier by univariate t_(k)-distance, but by multivariate T²-statistic.P2 is erroneously detected as outlier by the univariate t_(k)-distance,but not by multivariate T²-statistic. For P₃ (non-outlier) and P₄(outlier) both methods give the same decision.

FIG. 3: {tilde over (T)}²-Distances of robust PCA versus classical PCAfor the Lymphoma dataset. The {tilde over (T)}_(UCL) ² values are shownas two dotted lines. Chips to the right of the vertical line weredetected as outliers by robust PCA. Chips above the horzontal line weredetected as outliers by classical PCA. Chips classified as‘anacceptable’ by visual inspection are shown as squares, ‘acceptable’chips as triangles and ‘good’ chips as crosses. Note that ‘goos’ chipsdetected as outliers by rPCA have all been confirmed to show saturatedhybridization signals. The {tilde over (T)}_(UCL) ² values arecalculated with d=10 and significance level α=0.025.

FIG. 4: T² control chart of ALL/AML study. Over the course of theexperiment a total of 46 oligomeres for 35 different CpG positions hadto be re-synthesized. Oligos were replaced at time indices 234 and 315.The upper plot shows the T-distance of 433 hybridizations, where thegrey curve shows the running average as computed by a lowess fit. Thelower plot shows the T_(w) and L-distance between HDS and CDS with awindow size of N_(HDS)=N_(CDS)=75.

FIG. 5: T² control chart of simulated probe exchange in the Lymphomadata set, Between chips 300 and 500 an accidental oligo probe exchangeduring slide production was simulaed by rotating 12 randomly selectedCpG positions. The upper plot shows the T-distance of all 647hybridisations, where the line of the curve shows the running averageascomputed by a lowess fit. Triangular points are chips classified as‘unacceptable’ by visual inspection. The lower plot shows the T- andL-distance between HDS and CDS with a window size of N_(HDS)=N_(CDS)=75

FIG. 6: T² control chart of temperature experiment. The same ALL/AMLsamples were hybridized at 4 different temperatures. The upper pit showsthe T-distance of all 207 hybridizations to the HDS, where the line ofthe curve shows the running average as computed by a lowess fit. Thelower plot shows the T_(w) and L-distance between HDS and CDS with awindow size of N_(HDS)=N_(CDS)=30

FIG. 7: A panel of box plots, wherein the experimental series describedaccording to Example 2 corresponds to box plot ‘I’. The variabledistribution summarized is the 75% quantiles of the standard deviationsof the per spot percentage of pixels that surpass the per spot onestandard deviation about the mean of all pixel values threshold. Thelower horizontal line displays the 75% quantile and the 95% quantile ofthis distribution calculated from the combined five data sets shown inthe individual box plots to the ‘2’ to ‘6’. The thus defined thresholdsare used for grading the experimental unit with respect to this singlevariable.

1. A method of verifying and controlling assays for the analysis ofnucleic acid variations by means of statistical process control,characterized in that variables of each experiment are monitored bymeasuring deviations of said variables from a reference data set andwherein said experiments or batches thereof are indicated as unsuitablefor further interpretation if they exceed predetermined limits.
 2. Amethod according to claim 1 wherein said nucleic acid variations arecytosine methylation variations.
 3. A method according to claim 1wherein said statistical process control is taken from the groupcomprising multivariate statistical process control and univariatestatistical process control.
 4. A method according to claim 1 comprisingthe following steps: a) defining a reference data set; b) defining atest data set; c) determining the statistical distance between thereference data set and test data set or elements or subsets thereof; d)identifying individual elements or subsets of the test dataset whichhave a statistical distance larger than that of a predetermined value.5. The method according to claim 4, further comprising in step b)reducing the data dimensionality of the reference and test data set bymeans of robust embedding of the values into a lower dimensionalrepresentation.
 6. The method according to claim 5 wherein step b) iscarried out by calculating the embedding space using one or both of thereference and the test data sets.
 7. The method according to claim 4further comprising, e) further investigating said identified elements orsubsets of the test dataset to determine the contribution of individualvariables to the determined statistical distance.
 8. The methodaccording to claim 4 further comprising, e) excluding said identifiedexperiments or batches thereof from further analysis.
 9. The method ofclaim 4 wherein in step d) said statistical distance is calculated bymeans of one or more methods taken from the group consisting of theHotelling's T² distance between a single test measurement vector and thereference data set, the Hotelling'-T² distance between a subset of thetest data set and the reference data set, the distance between thecovariance matrices of a subset of the test data set and the covariancematrix of the reference set, percentiles of the empirical distributionof the reference data set and percentiles of a kernel density estimateof the distribution of the reference data set, distance from thehyperplane of a nu-SVM, estimating the support of the distribution ofthe reference data set.
 10. The method according to claim 5 wherein thedata dimensionality reduction is carried out by means of principlecomponent analysis.
 11. The method according to claim 5 wherein the datadimensionality reduction step comprises the following steps: i)Projecting the data set by means of robust principle component analysis;ii) Removing outliers from the data set according to their statisticaldistances calculated by means of one or more methods taken from thegroup consisting of: Hotelling's T² distance; percentiles of theempirical distribution of the reference data set; Percentiles of akernel density estimate of the distribution of the reference data setand distance from the hyperplane of a nu-SVM, estimating the support ofthe distribution of the reference data set; iii) Calculating theembedding projection by standard principle component analysis andprojecting the cleared or the complete data set onto this basis vectorsystem.
 12. The method according to claim 4 wherein at least one of thevariables measured in steps a) and b) is determined according to themethylation state of the nucleic acids.
 13. The method according toclaim 4 wherein at least one of the variables measured in step a) and b)is determined by the environment used to conduct the assay.
 14. Themethod according to claim 4 to wherein said data sets comprises one ormore variables selected from the group comprising meanbackground/baseline values; scatter of the background/baseline values;scatter of the foreground values, geometrical properties of the array,percentiles of background values of each spot and positive and negativeassay control measures.
 15. A method according to claim 4 wherein thereference data set is the complete series of experiments being analysed.16. A method according to claim 4 wherein the reference data set isderived from experiments carried out separately to those of the testdata set.
 17. A method according to claim 4 wherein the reference dataset is derived from a set of experiments wherein the value of eachvariable of each experiment is either within a predetermined limit oroptimally controlled.
 18. A method according to claim 4 furthercomprising the generation of a document comprising said elements orsubsets of the test data determined according to step d) of claim
 4. 19.A method according to claim 18 wherein said document further comprisesthe contribution of individual variables to the determined statisticaldistance.
 20. A method according to claim 18 wherein said document isstored on a computer readable format.
 21. A method according to one ofclaims 1 to 20 wherein said method is implemented by means of acomputer.
 22. A computer program product for the verifying andcontrolling assays for the analysis of nucleic acid variationscomprising: a) a computer code that receives as input a reference dataset; b) a computer code that receives as input a test data set; c) acomputer code that determines the statistical distance between thereference data set and test data set or elements or subsets thereof; d)a computer code that identifies individual elements or subsets of thetest dataset which have a statistical distance larger than that of apredetermined value; and e) a computer readable medium that stores thecomputer code.
 23. The computer program product of claim 22 furthercomprising f) a computer code that reduces the data dimensionality ofthe reference and test data set by means of robust embedding of thevalues into a lower dimensional representation.
 24. The computer programproduct of claim 22 characterised in that the embedding space iscalculated using one or both of the reference and the test data sets.25. The computer program product of claims claim 22 to 24 furthercomprising, g) a computer code that investigates said identifiedelements or subsets of the test dataset to determine the contribution ofindividual variables to the determined statistical distance.
 26. Thecomputer program product of claim 22 wherein said statistical distanceis calculated by means of one or more methods taken from the groupconsisting of the Hotelling's T² distance between a single testmeasurement vector and the reference data set, the Hotelling'-T²distance between a subset of the test data set and the reference dataset, the distance between the covariance matrices of a subset of thetest data set and the covariance matrix of the reference set,percentiles of the empirical distribution of the reference data set andpercentiles of a kernel density estimate of the distribution of thereference data set, distance from the hyperplane of a nu-SVM, estimatingthe support of the distribution of the reference data set.
 27. Thecomputer program product of claim 23 wherein the data dimensionalityreduction is carried out by means of principle component analysis. 28.The computer program product of claim 23, wherein the datadimensionality reduction step comprises the following steps: (i)Projecting the data set by means of robust principle component analysis;(ii) Removing outliers from the data set according to their statisticaldistances calculated by means of one or more methods taken from thegroup consisting of: Hotelling's T² distance; percentiles of theempirical distribution of the reference data set; Percentiles of akernel density estimate of the distribution of the reference data setand distance from the hyperplane of a nu-SVM, estimating the support ofthe distribution of the reference data set; (iii) Calculating theembedding projection by standard principle component analysis andprojecting the cleared or the complete data set onto this basis vectorsystem.
 29. The computer program product of claims 22 to 28 furthercomprising a computer code that generates a document comprising saidelements or subsets of the test data which have a statistical distancelarger than that of a predetermined value.